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Abstract 

We investigate models for learning the class of context-free and 
context-sensitive languages (CFLs and CSLs). We begin with a brief 
discussion of some early hardness results which show that unrestricted 
language learning is impossible, and unrestricted CFL learning is compu- 
tationally infeasible; we then briefly survey the literature on algorithms 
for learning restricted subclasses of the CFLs. Finally, we introduce a 
new family of subclasses, the principled parametric context-free grammars 
(and a corresponding family of principled parametric context-sensitive 
grammars), which roughly model the "Principles and Parameters" frame- 
work in psycholinguistics. We present three hardness results: first, that 
the PPCFGs are not efficiently learnable given equivalence and member- 
ship oracles, second, that the PPCFGs are not efficiently learnable from 
positive presentations unless P = NP, and third, that the PPCSGs are 
not efficiently learnable from positive presentations unless integer factor- 
ization is in P. 
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1 Introduction 



A great deal modern psycholinguistics has concerned itself with resolving the 
problem of the so-called "poverty of the stimulus" — the claim that natural lan- 
guages are unlearnable given only the data available to infants, and consequently 
that some part of syntax must be "native" (i.e. prespecified) rather than learned. 
Gold's theorem (described below), which states that there exists a superfinitc 
class of languages which is not learnable in the limit from positive presentations, 
is often offered as proof of this fact (though the extent to which the theorem is 
psycholinguistically informative remains a contentious issue). |gor90| 

But how is innate linguistic knowledge represented? One mechanism usually 
offered is the Chomskian "Principles and Parameters" framework }cho93j , which 
suggests that there is a set of universal principles of grammar which inhere in 
the structure of the brain. In this framework, the process of language learn- 
ing simply consists of determining appropriate settings for a finite number of 
parameters which determine how those principles are applied. 

While this problem is generally supposed to be easier than unrestricted lan- 
guage learning, we are not aware of any previous work specifically aimed at 
studying the Principles and Parameters model in a computational setting. In 
this report, we introduce a family of subclasses of the context-free languages 
which we believe roughly captures the intuition behind the Principles and Pa- 
rameters model, and explore the difficulty of learning that model in various 
learning environments. 

We begin by presenting an extremely brief survey of the existing literature 
on the hardness of language learning; we then introduce three hardness results, 
one unconditional, one complexity-theoretic and one cryptographic, which sug- 
gest that the existence of a generalized algorithm for learning in the principles 
and parameters framework is highly unlikely. While we obviously cannot pro- 
duce any psychologically definitive results in this setting, we at least hope to 
challenge the notion that the Principles and Parameters framework is somehow 
a computationally satisfying explanation of the language learning process. 

2 Background 

2.1 Definitions 

2.1.1 Learnability in the limit 

Gold defines the language learning problem as follows: [gol67 

Definition 1. Given a class of languages C and an algorithm A, we say A 
identifies C in the limit from positive presentations if VL, Vii,i2, ■ ■ ■ E L, 
there is a time t such that for all u > t, h u = h t = A(i%, • • • , it). 
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2.1.2 Exact identification using queries 

Modeling the language learning process as being entirely dependent on positive 
examples seems rather extreme; it's useful to consider enviornments in which the 
learner has access to a richer representation of the language. Angluin |ang90| 
describes a model of language learnability from oracle queries, as follows: 

Definition 2. An equivalence oracle for a language L takes as input the rep- 
resentation of a language r{L) and outputs "true" if L = L* , or some w £ LAL* 
(the symmetric difference of the languages) otherwise. There is an obvious 
equivalence, first pointed out by Littlestone |lit88j . between the equivalence 
query model and the online mistake bound model. 

Definition 3. A membership oracle for a language L with start symbol S 
takes a string w, and outputs true if S =>* w and false otherwise. 

Definition 4. A nonterminal membership oracle for a language L takes 
a string w (not necessarily in L) and a nonterminal A, and outputs whether 
A =>* w (i.e. whether the set of possible derivations with A as a start symbol 
includes w). 

Definition 5. A class of languages C is learnable from an equivalence 
oracle (or analogously from an equivalence oracle and a membership oracle, 
sometimes referred to as a "minimal adequate teacher" ) if there exists a learning 
algorithm with runtime polynomial in the size of the representation of the class 
and length of the longest counterexample. 

2.2 Hardness of language learning 

Theorem (Gold). There exists a class of languages not learnable in the limit 
from positive presentations. 

Proof sketch. Construct an infinite sequence of languages L\ C L% C • • • , all 
finite, and let — [J.Li. Suppose there existed some algorithm A that 
could identify each Li from positive presentations. Then there is a positive 
presentation of L^ that causes A to make an infinite number of mistakes. First 
present a set of examples, all in L\, that force A to identify L\. Then present a 
set of examples forcing it to identify L2, then L3, and so on. An infinite number 
of mistakes can be forced in this way, so L^ is not learnable in the limit. 1 

While space does not permit us to discuss the proof here, we also note the 
following important result for CFL learning: 

Theorem (Angluin | ang80 ) . There exists a class of context-free languages with 
"natural" representations which are not learnable from equivalence queries in 
time polynomial in the size of the representation. 
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2.3 Learnable subclasses of the CFLs 



While this last result rules out the possibility of a general algorithm for learning 
CFLs, subsets of the CFLs have been shown to be learnable when given slightly 
more powerful oracles. These include simple deterministic languages |ish90j . 
one-counter languages |ber87| and so-called very simple languages |yok91| . Par- 
ticularly heartening is Angluin's result that /c-bounded CFGs can be learned in 
polynomial time if nonterminal membership queries are permitted |ang87| . 

3 Principled Parametric Grammars 

We now introduce a formal model of the "principles and parameters" framework 
described in the introduction. 

3.1 Motivation 

Before moving on to the details of the construction, it's useful to consider a few 
example "principles" and "parameters" suggested by proponents of the model. 

• The pro-drop parameter: does this language allow pronoun dropping? 
If PNP is a non-terminal symbol designating a pronoun, this parameter de- 
termines whether or not a rule of the form PNP — > e exists in the language. 

• The ergative/nominative parameter: ergative languages distinguish 
between transitive and intransitive senses of verb by marking the subject, 
while nominative languages (like English) mark the object. Let NP and 
VP be non-terminal symbols for noun and verb phrases respectively, and 
let NP tr£ms and VP trans be distinguished versions of those symbols for erga- 
tive/nominative marking. Now, any language with Verb-Subject-Object 
order, there will be a rule S — > NP VP. In an ergative language, there is ad- 
ditionally a rule of the form S — > NP trans VP, and in a nominative language 
a rule of the form S — > NP VP trans . 

In each of these cases, a pattern holds: for every possible possible parameter set- 
ting, there is some finite set of context-free productions in the native grammar, 
from which only one must be selected as the element of the learned grammar. 
This leads very naturally to the following development of principled parametric 
context-free grammars as a model of the principles and parameters model. 

3.2 Construction 

Definition 6. An n-principled, k-parametric context-free grammar 

((n, fc)-PPCFG) T is a 4-tuple (V, S, II, S), where: 

1. V is a finite alphabet of nonterminal symbols 

2. £ is a finite alphabet of terminal symbols 
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3. II is a set of n production groups of the form 

(Ai tl -> a iA ), • • • , (A itj -» aij), • • • , (A l . k -> a i)fe ) 

where each a e (V U £)*, i.e. is a finite sequence of terminals and nonter- 
minals. Let Uij denote the production (Aij — > a^j). 

4. S 1 G V is the start symbol. 

Definition 7. A parameter setting p = (pi,p 2 ,--- ,Pn) is a sequence of 
length n, with each pi e l..fc. Then define T p to be the ordinary context-free 
grammar (V, E, i£, 5) with i? = {n^ Pi : i G l..n}. 

As usual, let L(G) denote the context free language represented by the CFG 
G. Then let A(T) = {L(G) : 3p : G = 

Definition 8. An algorithm A learns the PPCFGs from an equivalence 
oracle if V PPCFGs T and languages I € T, after a finite number of oracle 
queries, A outputs some p such that L(T p ) = I, or determines that no such p 
exists. 

Definition 9. A efficiently learns the PPCFGs from an equivalence oracle if 
the number of oracle queries it makes is bounded by some polynomial function 
poly(n,k). 

Definition 10. Finally, a principled parametric context-sensitive gram- 
mar is defined exactly as above, with corresponding learning definitions, but 
with context-sensitive productions in each production group. 

3.3 Equivalence 

Some useful facts about the PPCFGs: 

Observation. A "heterogeneous PPCFG" with a variable number of right hand 
sides can be transformed into a "homogeneous PPCFG" of the kind described 
above by "padding" out the shorter principles with duplicate rules (i.e. to 
insert an unambiguous production A — > a into an (n, 2)-PPCFG, add to II the 
production group (A — >• a), (A — >• a)). 

Observation. A (n, fc)-PPCFG can be converted into an (n(k - 1), 2)-PPCFG 
as follows: replace each principle 

A (ai,a 2 , ■ ■ ■ ,a k ) 

with a set of principles 

A t -> (cui, A 2 ) 
A 2 -> (a 2 ,A 3 ) 

A k -i -> (a k -i,a k ) 
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Thus without loss of generality we may treat every PPCFG as a (n, 2)- 
PPCFG. The conversion above results in only a polynomial increase in the 
number of principles, so any algorithm which is polynomial in n, and which 
assumes k = 2, can be used to solve k > 2 with only a polynomial increase in 
running time. This also means that we may specify an individual language in a 
PPCFG by a bit string of length n. 

Finally, note that a fc-PPCFG with n rules contains at most k n languages. 

4 Generic hardness results for PPCFGs 

We will construct a minimal adequate teacher T consisting of two oracles EQ 
(an equivalence oracle) and M (a membership oracle) , such that any algorithm 
A requires an exponential number of queries to identify the correct parameter 
setting p from a PPCFG T. 

Theorem 1. Without condition, there exists no algorithm A capable of learning 
the PPCFGs from equivalence queries and membership queries in polynomial 
time. 

Proof. Fix some number N. Construct the PPCFG T with 

V = Xi : i e 1..N 
S = START 

£ = {0,1} 

and II as defined as follows: 

(START XiX 2 ■ ■ ■ X N ) 

(x k ^o,x k ^i) Vfcei.J 

Every parameter setting p in this grammar allows it to derive precisely 1 
string: every production is deterministic. Consequently, the N possible settings 
of the grammar derive 2 N unique strings. Given some algorithm A for learning 
PPCFGs, the procedure specified below describes an adversarial distinguisher 
for this PPCFG which forces the learner to make a total of 2 N — 1 queries. 

After each query, the number of grammars still possible given the evidence 
provided so far decreases by precisely 1 (because each grammar is capable of 
producing only string), so after 2^ — 1 queries of either kind, the oracle must 
output true. 

Thus, only after 2^ — 1 queries (superpolynomial in |T| and the length of 
the longest production) can the learner halt, so the grammar is not efficiently 
learnable from membership and equivalence queries. I 



G 



i <- 

while i < 2 N - 1 do 
on query EQ(T') 

if r' has not been previously queried then 

end if 

return FALSE, L(T') > L(T') contains only one string 

end query 
on query M(w) 

if w has not been previously queried then 



end if 

return FALSE 
end query 
end while 
on query 

return TRUE > Only one language is consistent with the evidence 

end query 



5 Complexity-theoretic hardness results for 



We will construct a reduction from 3SAT to PPCFG learning. Let X = {xi} 
be a set of variables and C — {cj} be a set of clauses. Let us write Xj G if 
the jth ith clause is satisfied by the jth variable, and Xj € Ci if the ith clause is 
satisfied by the negation of the jth variable. 



Then construct the PPCFG T with V = X U {START}, E = C, S = START, 
and LT with the following production groups: 



Note that only for production groups of the form (xi — > s^.t), {xi — > x^f) 
does the parameter setting change the resulting language. These groups may 
be thought of as assigning truth values to the variables. 

Proposition 1. If there exists some I e T such that Vc^ £ C : c» € I, then the 
3SAT instance is satisfiable. 

Proof. Set Xi true if the rule — > Xi.T is chosen, and false otherwise. For any 



PPCFGs 



(START -> x x x 2 ■ ■ ■ x n ) 

(Xi -> X i>T ), {Xi -> Xi t F 

(xi -> e) 

(Xj, T -> Ci) 

(x jiF Ci) 



Vx, e X 
Vxi e A 

Vcj e C, Vxj e Q 
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Ci in the language, there is a derivation from START =>* q of the following form: 



START xix 2 ■••x. 



n 



=> Xj 



=> Xj, 



Then Xj satisfies c^. 



■ 



Proposition 2. If the 3 SAT instance is satisfiable, there exists some I € T such 
that Vci gC : Ci d I. 

Proof. Choose the rule (xi — > x^t) if %i is set true in the satisfying assignment, 
and (xi — > Xi.F) if x% is set false. These settings determine /. Then, consider 
any string Cj. There is some variable Xj with truth value a which satisfies the 
corresponding clause; then by assignment I contains a production of the form 
Xj — > Xj^ a , and by definition contains a production of the form Xj >a — > Ci, so 
derivation identical to the one in the previous proposition must exist. 1 

Theorem 2. J/P / NP, no efficient algorithm exists for learning PPCFGs 
from positive presentations. 

Proof. Assume that there exists some algorithm A which efficiently learns the 
PPCFGs from positive presentations. We will use A to construct a SAT solver 

5 by simulating the oracle. Construct T from the SAT instance as described 
above. Then S"s interaction with A takes the following form: 

By assumption, after observing polynomially many positive presentations, 
and performing polynomially many computations, A outputs a parameter set- 
ting p which produces every Ci € C, or a signal indicating no such assignment 
exists. From Propositions[T]and[2] such a p exists if and only if the SAT instance 
is satisfiable. Thus S determines in a polynomial number of steps whether the 
SAT instance is satisfiable, and the existence of A implies P = NP. 1 

6 Cryptographic hardness results for PPCSGs 

We will construct another reduction, this time from integer factorization to 
PPCSG learning. Let N be a product of two (n — l)-digit primes. 

Let A be a set of non-terminal symbols Aq..A^ , and B, C, Z be similar 
sets of nonterminals of cardinality [lg^V] + 1. Then construct the PPCSG T 
with 



V = AUBUCUZU{S} 
E = {c k } 



Vfc e 0..|"lgAf| 



S = s 
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and II with the following production groups: 



(S- 

(s- 

(Bj 
(Bo 



S) 



e) 



B ),(A ^e) 
Aj-^iAj^BjAj-!) 



Bi 



'3-i B j-i) 
Co) 

(CfcC/t -> CfeZfe) 
(Cfe+iZfe -4- Cfe+i) 



Vjei.-rigViV] 
VjeL.rig^l 

Vfc e O.-flg AT] 



Intuitively, the parameter settings in this grammar (A, — > Aj_i),(Aj — > 
BjAj-i) fix some number m between 1 and VlV. Each ^4p lg y]v] ^* ^ m > so 
5 C mfe for all fc, i.e. the unary representation of all multiples of m. This 
unary string may then be collapsed into a |~lg N] -ary representation as a string 
of terminal c,s. 

Let I be the language consisting of the single string w, where w is the con- 
catenation of every a, such that the ith digit of the binary representation of N 
is 1. 

Given a parameter setting s for T, for each production group (A, — > A,_i), 
(Aj — » BjAj-i) in s, let = if the first setting is chosen and 1 if the second 
setting is chosen. Let P s be the number whose binary representation is given by 
the piS. Alternatively, given a binary number P let sp be the parameter setting 
induced by P's bits. 

Finally, some notation: given a sequence of strings S, let || ggS Sj denote the 
concatenation of all s,s. 

Proposition 3. Given numbers P and Q, P < Q, if PQ — N then w e L(T Sp ). 
Proof. In T ST , 

S^* S Q 

Q 



( 



X 



B l 



0<i< [lg VN] 
T i = l 



iN 



B, 



N 



c, 



o 



w 
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Proposition 4. If w G L{Y S ), then there exists Q such that P S Q = N. 

Proof. Certainly if w € L(Y S ), w. But S C^ sk for all k (using the 

derivation in Proposition [3]); then there exists some Q such that PQ — N. I 

Theorem 3. If integer factorization is hard, no efficient algorithm exists for 
learning random PPCSGs with non-negligible probability from an equivalence 
oracle. 

Proof. Assume that there exists some algorithm A which, given Y and the pos- 
itive presentation of the single string w as specified above, outputs a parameter 
setting P for Y such that w £ L(Yp) with non- negligible probability a poly- 
nomial number of computations. Then we will construct a factorizer F that 
decomposes N into P and Q. 

From the preceding conjectures, if an acceptable P is found then PQ = N . 
for some Q, so if A can find a parameter setting in polynomial time then this 
algorithm finds a factorization in polynomial time. 1 

This final proof is neither particularly interesting or satisfying: even the 
task of finding a derivation in a CSG is known to be PSPACE-complete (though 
it's easy to see that a polynomial-time parsing algorithm for this particular 
family of grammars exists). Note that the only context-sensitive production 
groups employed in this production are used to guarantee a compact encoding 
of w; we suspect that there is an alternative way of constructing this "grammar 
arithmetic" that requires only weaker rules, perhaps mildly context-sensitive or 
even context-free. We thus close with the following: 

Open Problem. If integer factorization is hard, does there exist a polynomial- 
time algorithm for learning random PPCFGs with non-negligible probability from 
positive presentations? 

7 Conclusion 

We have introduced a new model, the princpled parametric context-free (also 
context-sensitive) grammars as a model of the "Principles and Parameters" 
model in psycholinguistics, and presented three hardness-of-learning results for 
the class of PPCFGs and PPCSGs. While these results certainly do not demon- 
strate definitively that learning under the Principles and Parameters framework 
is completely impossible (all that is required for human language learning to be 
possible is that one PPCFG be efficiently learnable), we have shown that there 
is likely no generic algorithm for learning a class of PPCFGs given either oracle 
and membership queries or a positive presentation. In general, these results 
prove that even radically restricting the class of candidate grammars does not 
guarantee a successful outcome when attempting to learn CFGs and CSGs. 
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